This report explores a dataset containing results for the Test of English for International Communication (TOEIC), for approximately 2,000 enrollments of the Faculty of Business and Economics from the University of Chile. It also contains their results on the Chilean Higher Education Selection Exam (PSU) as well as other atributtes related to their High Schools.
## [1] 1956 10
## 'data.frame': 1956 obs. of 10 variables:
## $ year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ gender : Factor w/ 2 levels "female","male": 1 1 2 1 1 2 2 2 2 2 ...
## $ hs.type : Factor w/ 3 levels "private","public",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ hs.location: Factor w/ 2 levels "capital_city",..: 2 1 1 1 1 1 1 1 1 1 ...
## $ hs.score : int 744 744 682 744 682 723 785 682 702 744 ...
## $ math : int 688 718 774 756 706 774 850 737 673 850 ...
## $ history : int NA 694 708 694 735 660 772 757 NA NA ...
## $ science : int 708 NA 638 NA 708 614 642 NA 672 727 ...
## $ spanish : int 743 776 710 732 801 721 700 754 682 792 ...
## $ toeic : num 677 673 667 667 667 660 657 657 653 653 ...
## year gender hs.type hs.location
## Min. :2005 female: 791 private :1017 capital_city:1562
## 1st Qu.:2006 male :1165 public : 388 other_region: 394
## Median :2008 semiprivate: 551
## Mean :2008
## 3rd Qu.:2009
## Max. :2010
##
## hs.score math history science
## Min. :435.0 Min. :580.0 Min. :432.0 Min. :166.0
## 1st Qu.:661.0 1st Qu.:694.0 1st Qu.:630.0 1st Qu.:606.0
## Median :702.0 Median :719.0 Median :677.0 Median :645.5
## Mean :694.8 Mean :726.5 Mean :678.6 Mean :642.7
## 3rd Qu.:744.0 3rd Qu.:756.0 3rd Qu.:726.0 3rd Qu.:680.0
## Max. :826.0 Max. :850.0 Max. :850.0 Max. :824.0
## NA's :462 NA's :716
## spanish toeic
## Min. :440.0 Min. :217.0
## 1st Qu.:639.0 1st Qu.:407.0
## Median :674.0 Median :467.0
## Mean :678.7 Mean :461.6
## 3rd Qu.:717.0 3rd Qu.:528.5
## Max. :831.0 Max. :677.0
##
Our dataset consists of ten variables, with almost 2,000 observations.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 217.0 407.0 467.0 461.6 528.5 677.0
The TOEIC distribution appears to be unimodal with the score peaking around 475. It’s important to note that the scale tof this test goes from 10 to 990, so the mean (461) it’s quite low.
In regards to the categorical variables, we can see that how most enrollments are males, come from private high schools located in the capital city, Santiago.
It’s seems like hs.score has a semi discrete distribution shape. This make sense given that this is the result of a transformation from a different scale of scores (from 1 to 7). This it’s done by the National Eduation Ministry to make high school’s scores easier to compare with the higher education selection exam’s results.
It is important to say that for the high school score, as well as for the national tests of maths, history, science and spanish scores, the scale of scores goes from 350 to 850.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 580.0 694.0 719.0 726.5 756.0 850.0
It’s an interesting distribution as on the left tail looks like a continuos variable but the closer it gets to the maximum score (850), it starts to behave in a more discrete way. This make sense given the test it’s only 70 questions long and the penalties are relatively higher when you have less incorrect answers.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 432.0 630.0 677.0 678.6 726.0 850.0 462
This distribution seems more “normal”, given that the median and mean (651 and 653) are further apart from the maximum test score (850), in comparison to the math scores.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 166.0 606.0 645.5 642.7 680.0 824.0 716
Looks quite unimodal with a peak around 650. Just as before, the further from the maximum score, the more continuos and normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 440.0 639.0 674.0 678.7 717.0 831.0
Similar case to math’s distribution.
It seems that most test results have a similar distribution shape, peaking around 650 points. With the exception beign math, where the peak is around 725 points. This makes sense, given that mathematics has a considerable higher relative value when enrolling into the Faculty.
Having said that, at this point I should mention that the net enrollment score it’s calculated by the following equaiton:
\(enrollment.score = math*0.5 + hs.score*0.2 + max(history,science)*0.2 + spanish*0.1\)
I am interested in that variable as well, so I will include it in the rest of the analysis. Note that the TOEIC test does not take part in this equation as it is not a requierment and it’s only undertaken to assess already enrolled students.
There are 1,956 enrollments in the dataset with 10 variables (year, gender, hs.type, hs.location, hs.score, math, history, science, spanish and toeic).
Other observations:
The main features in the data set are toeic and spanish. I’d like to determine which features are best for predicting results on the TOEIC test undertaken by new enrollments. I suspect spanish test score and some combination of the other variables can be used to build a predictive model for TOEIC results.
Hs.location and hs.score likely contribute to the level of english of a recent graduate. I think hs.type (either private or public) and enrollment.score probably contribute most to the TOEIC results as they could show a level of self confidence or higher general knoledge when completing the test.
I created a variable for the enrollment final score (enrollment.score) using the other tests’ scores and their correpsonding relative values. This arose in the univariate section of my analysis when realising that the TOEIC test it’s undertaken after the recent graduates are already enrolled and they know their final enrollment score, which could play a self confidence role when completing the TOEIC test.
The only unusual thing was how the the continuos variables tended to behave in a discrete way when approaching the highest posible value. Although this make sense given that the tests results are constructed to separate the whole population.
In the enrollment process only the highest score between history and science was considered to calculate the net enrollment score. This is why there where so many values with 0’s within this two tests. I transformed these values to NA’s.
## hs.score math history science spanish toeic
## hs.score 1.00 -0.39 -0.06 0.00 0.07 0.03
## math -0.39 1.00 0.00 0.13 0.02 0.07
## history -0.06 0.00 1.00 0.01 0.40 0.15
## science 0.00 0.13 0.01 1.00 0.21 0.11
## spanish 0.07 0.02 0.40 0.21 1.00 0.25
## toeic 0.03 0.07 0.15 0.11 0.25 1.00
## enrollment.score 0.09 0.73 0.43 0.36 0.43 0.20
## enrollment.score
## hs.score 0.09
## math 0.73
## history 0.43
## science 0.36
## spanish 0.43
## toeic 0.20
## enrollment.score 1.00
Toeic correlates strongly with spanish, which was my suspicion. However, to my surprise hs.score does not correalte strongly with toeic.
Spanish also correlates strongly with history, which makes sense. It also correlates with science, in a weaker way.
There’s a strong negative correlation between math and hs.score, which is weird.
Math, and science do not seem to have strong correlations with toeic.
I want to take a closer look at scatter plots involving toeic and some other continuos variables like spanish, hs.score, history and enrollment.score
It seems to be a lot of noise, but there’s definitely a positive relationship between spanish and toeic scores.
Nope. Even after adding jitter, transparency, and changing the size of the points, there doesn’t seem to be any relation between toeic and hs.score.
There is definitely a positive correlation, but the slope is not high as in with spanish.
This one was harder to see, so ontop of the jitter, size and transparency, I utilised a smooth line (linear model) to establish the relation between toeic and enrollment.score.
Before I move on, I want to take another look to the relationship between hs.score and math.
There is a clear negative correlation. After thinking about this for a while, I guess it makes sense considering the enrollment criteria (formula).
Next, I’ll have a closer look at how the categorical features vary with toeic.
## enrollments$hs.location: capital_city
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 217.0 410.0 470.0 463.7 533.0 673.0
## --------------------------------------------------------
## enrollments$hs.location: other_region
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 237.0 393.0 457.0 453.3 519.8 677.0
It seems like high schools located in the capital city have slightly higher scores than the ones from other regions. There are 27 points of difference between the median values for both groups, but it is not as relevant as I was expecting.
## enrollments$hs.type: private
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 223.0 443.0 500.0 493.8 551.0 677.0
## --------------------------------------------------------
## enrollments$hs.type: public
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 217.0 380.0 433.0 431.4 487.0 620.0
## --------------------------------------------------------
## enrollments$hs.type: semiprivate
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 217.0 360.0 430.0 423.4 490.0 630.0
There is a difference of 67 points for the median toeic value between private and public high scool, which is quite considerable.
## enrollments$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 217.0 397.0 453.0 451.9 513.0 677.0
## --------------------------------------------------------
## enrollments$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 217.0 413.0 477.0 468.2 537.0 667.0
It seems like male scores are slightly higher for men than female. 24 points of difference in the median value between groups.
Toeic correlates strongly with spanish and slightly with history and enrollment.score. Also, hs.type has a considerable effect on toeic results.
Gender and hs.location have a lower but significant impact in toeic results. On the other hand, hs.score, math, and science do not have strong correlations with toeic.
Spanish correlates strongly with history and also, in a lower degree, with science.
Math and hs.core have a strong negative correlation. This could be explained as these scores are the two strongest coeficients in the enrollment net score equation. Also considering that most of the data is located at the very end of tests’ scale, it is likely that one enrollment has either a high score in math or high score at high school, but not both.
Enrollments’ toeic test results are positively and strongly correlated with spanish results. With less strenght, toeic also correlates with enrollment.score and history.
Given the correlation between toeic and spanish, I created a ratio between these two, in order to establish a “general linguistic” measurement. Then I wanted to see how these three categorical variables distributed along this ratio.
Now let’s see how these variables affect on the relation betweem toeic and spanish, in order to try and build a predictive model.
There’s is a trend of higher toeic results for private high school enrollments, although this trend it’s not very clear when looking high schools from other regions.
There is a small trend on higher scores for male enrollments, although not as strong as with hs.type.
These plots suggest that we can build a linear model and use those variables in the linear model to predict the enrollment’s TOEIC results.
# Create variables corresponding to each different model
m1 <- lm(toeic ~ spanish, data = enrollments)
m2 <- update(m1, ~ . + hs.type)
m3 <- update(m2, ~ . + gender)
m4 <- update(m3, ~ . + enrollment.score)
m5 <- update(m4, ~ . + history)
m6 <- update(m5, ~ . + hs.location)
# Table the results for each model
mtable(m1, m2, m3, m4, m5, m6, sdigits = 3)
##
## Calls:
## m1: lm(formula = toeic ~ spanish, data = enrollments)
## m2: lm(formula = toeic ~ spanish + hs.type, data = enrollments)
## m3: lm(formula = toeic ~ spanish + hs.type + gender, data = enrollments)
## m4: lm(formula = toeic ~ spanish + hs.type + gender + enrollment.score,
## data = enrollments)
## m5: lm(formula = toeic ~ spanish + hs.type + gender + enrollment.score +
## history, data = enrollments)
## m6: lm(formula = toeic ~ spanish + hs.type + gender + enrollment.score +
## history + hs.location, data = enrollments)
##
## ==============================================================================================================================================
## m1 m2 m3 m4 m5 m6
## ----------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 195.090*** 232.126*** 219.250*** 143.336** 141.579** 140.950**
## (23.417) (21.792) (21.952) (48.479) (53.502) (53.460)
## spanish 0.393*** 0.385*** 0.391*** 0.364*** 0.358*** 0.357***
## (0.034) (0.032) (0.032) (0.035) (0.042) (0.042)
## hs.type: public -65.475*** -65.231*** -64.580*** -65.634*** -65.901***
## (4.748) (4.731) (4.743) (5.397) (5.395)
## hs.type: semiprivate -67.402*** -66.281*** -65.244*** -65.104*** -64.033***
## (4.210) (4.204) (4.243) (4.874) (4.905)
## gender: male/female 14.514*** 13.445*** 12.475** 12.184**
## (3.658) (3.707) (4.263) (4.263)
## enrollment.score 0.134 0.141 0.146
## (0.076) (0.088) (0.088)
## history 0.003 0.003
## (0.035) (0.035)
## hs.location: other_region/capital_city -9.270
## (5.015)
## ----------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.063 0.203 0.209 0.210 0.212 0.214
## adj. R-squared 0.062 0.202 0.207 0.208 0.209 0.210
## sigma 86.112 79.455 79.156 79.114 78.046 77.983
## F 130.425 165.453 128.962 103.897 66.557 57.629
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -11489.694 -11331.307 -11323.448 -11321.903 -8626.190 -8624.474
## Deviance 14489469.061 12323068.799 12224441.218 12205141.077 9057579.026 9036802.711
## AIC 22985.387 22672.615 22658.897 22657.806 17268.379 17266.948
## BIC 23002.123 22700.508 22692.369 22696.857 17310.853 17314.731
## N 1956 1956 1956 1956 1494 1494
## ==============================================================================================================================================
Private high schools have a higher median for the ratio toeic/spanish. The variance across the groups seems to be about the same with semiprivate type of high school having the greatest variation for the middle 50% of enrollments.
Holding spanish test results constant, enrollments coming from a private high school get consistent higher toeic results than enrollments coming from a public or semiprivate high school.
Even though the impact of private high schools in toeic level results is high, this difference it is not so noticeable for enrollments coming from public or semiprivate high schools located outside of the capital city.
Yes, I created a linear model starting from the toeic test results and spanish test results.
The variables in the linear model account for 21.4% of the variance in the toeic test results. Hs.type and gender improved the model considerably. The adition of history, enrollment.score and hs.location, improved moderatly the R^2 value, which is coherent with what we saw in the plots.
The distribution of the TOEIC results for the enrollments appears to be unimodel peaking around 475. This is considerably low considering that the highest possible score is 990 points.
Enrollments coming from private high schools have the highest median TOEIC result. The variance in TOEIC results it’s larger for enrollments coming from semiprivate high schools. In the case of public high schools, the variance is lower, similar to private high schools, but the median TOEIC results found here it’s lower and almost the same as for semiprivate high schools.
The plot indicates that a linear model could be constructed to predict enrollments’s TOEIC performance using toeic as the outcome variable and spanish as the predictor variable. Holding spanish results constant, enrollments coming from private high schools, get consistent higher toeic results than enrollments coming from public or semiprivate high schools.
The enrollments data set contains information on almost 2,000 new students of the Faculty of Business and Economics of the University of Chile enrolled between 2005 and 2010, across 10 variables including scores from the National Higher Education Selection test, as well as variables related to their high schools.
I started by looking and analysing the behaviour of certain variables within the data set, then I explored some questions of my interest as I kept on making observations on plots. Eventually I explored the TOEIC test results an its relation with the Spanish test scores and created a linear model to predict TOEIC test results.
There was a clear trend between spanish results and TOEIC results. I was surprised to find out that the high school score didn’t influenced the performance on the TOEIC test and that also had a negative strong correlation with math test results. These two variables have a strong relative weight in the equation for enrollment so it seems logical to think that enrollments would have high scores in either one or the other.
I was also expecting that enrollments coming from outside of the main capital would have had lower scores in the TOEIC test, but location turned out to have no effect. Then I realized that a private high schools had a strong positive effect on TOEIC results, which makes sense in a developing country like Chile.
The first and obvious limitations responds to missing variables, that are inherit of the person’s background, interest and skills. The other limitations of this model include the source of the data. Given that enrollments consider only periods between 2005 and 2010, and also that these recent students reflect only to a tiny part of the population’s interests and capabilities. Maybe today the linguistic skills have evolved under some other correlations like, social exposure, philosofical knowledge, or something else. At the same time, maybe recent high school graduates who are interested in arts have a more direct correlation between english skills and, for example, their sense of aesthetics.
In any case, I would be interested to analyse more updated data to see if maybe it could be worth to get enrollments from other disciplines to undertake the TOEIC test and hopefully increase the model’s accuracy. Under this context, and if the trends are still coming up, this could be a good input scource for public policy: It could be the case that it is worthwhile focusing the nations’ educational budget in languages (spanish+english) rather than just in spanish.